Our study wants to ask participants in how they rate certain risks in their perceived riskiness. The “Risk Group” has decided to use a more systematic approach in filtering out risks which are worth investigating. They used several sources, such as referring to the Basel Risk Norms and 2 separate scoping reviews (one from our Risk Polarization group, the other from Amanda), and have listed around 100 risks, and labeled them according to domains (such as health, finances, political, crime and nature). Though to be as efficient as possible, we have to choose risks which are worth to be asked in the first place, as some risks are more similar than others.
One way to do it is to ask humans to rate/ sort them into clusters/ domains themselves. But as this also takes time and money, this project tries to leverage embeddings to do the clustering and mapping. Huge shout out to the rpackage embedR and its author Dirk Wulff. Working with embeddings in R is made very easy with this package, in addition to the generous pipeline(s) provided in his github webpage.
Here is a list of risks we are working with:
The data has two columns, the name of the domains
("Domain/Label"), and the risks (Risk/Items),
which are the targets of the embedding analysis. Using the
er_embed() function, we can embed the items. We will be
using the default all-mpnet-base-v2 model from hugging
face. This model is a good lightweight model for embedding analyses.
We can now see the embeddings for each risk
##
## embedR object
##
## Embedding: 99 objects and 768 dimensions.
##
##
## Embedding
##
## [,1] [,2] [,3] [,4] [,5]
## health 0.008797954 0.09673023 0.010030607 -0.01816359 0.03574587
## wildfires -0.025626084 0.06679159 0.017190674 -0.03250840 -0.01195548
## trust government -0.023342272 0.14670171 -0.006135981 0.03572665 0.00765732
## terrorism -0.014037776 0.02034166 0.004858475 -0.03388666 -0.06702958
## unemployment -0.017772257 0.07592832 0.007619646 -0.03449840 -0.01949257
Reducing the dimensions to two helps us in visualizing. The following plot is colored by “Team Risks” domain labels. Looking at the plot, most of the items are well withing their respective labels, but there are still outliers.
instead of comparing each risk to others visually, we can calculate the cosine similarity so we also have a numerical representation of how similar they are to each other (this would also help us in choosing which risks to take, as we dont want too similar ones, as they would become redundant).
One could pose whether these 5 domains are accurately describing our risks. Luckily, thanks to the embeddings, we have a similarity matrix now. With this, kmeans clustering becomes available. The following plots are generated with differing cluster amounts. As a rule of thumb, the higher the F-statistic (ratio of within labels sum of squares and between labels sum of squares), the better.
WIP: more clusters?
F-values are very similar for all the clusters…
Overlap due to the 768 dimensions? As this plot only uses a reduced one
of 2, where information gets lost…?